AITopics

Country:

Europe > France > Île-de-France > Paris > Paris (0.05)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.69)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.88)
Health & Medicine > Diagnostic Medicine > Imaging (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsFeb-10-2026, 09:16:05 GMT

83d349b6eb8125588b5f091e2d47525c-Paper-Conference.pdf

arxiv preprint arxiv, speech recognition, speech ssl model, (10 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Neural Information Processing SystemsDec-25-2025, 10:33:06 GMT

Toward a realistic model of speech processing in the brain with self-supervised learning

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g.

brain, self-supervised learning, speech processing, (7 more...)

Industry:

Health & Medicine > Health Care Technology (0.35)
Health & Medicine > Diagnostic Medicine > Imaging (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)

Yang, Chao-Han Huck, Stolcke, Andreas, Heck, Larry

Spoken Conversational Agents with Large Language Models

arXiv.org Artificial IntelligenceDec-3-2025

Building on this, we will examine joint text-speech pre-training (Chiu et al., 2022; Bar-rault et al., 2023; Chen et al., 2022) methods, This section will provide a comprehensive look at how state-of-the-art voice-interfaced LLMs (Reid et al., 2024; Chu et al., Current Trends The current work in AI virtual assistants builds upon the voice-only systems of the last decade by leveraging LLMs to significantly improve the coverage and robustness of the spoken language understanding and dialogue state tracking components, in addition to substantial advancements in spoken language generation. It highlights recent advancements in multi-turn dialogue systems, encompassing both LLM-based open-domain dialogue (ODD) and task-oriented dialogue (TOD) systems, as well as relevant datasets and evaluation metrics.

large language model, larry heck, machine learning, (16 more...)

2512.02593

Country: North America > United States (0.46)

Genre:

Instructional Material > Course Syllabus & Notes (1.00)
Overview (0.95)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Baser, Oguzhan, Tanriverdi, Ahmet Ege, Kale, Kaan, Chinchali, Sandeep P., Vishwanath, Sriram

WavShape: Information-Theoretic Speech Representation Learning for Fair and Privacy-Aware Audio Processing

arXiv.org Artificial IntelligenceOct-9-2025

Speech embeddings often retain sensitive attributes such as speaker identity, accent, or demographic information, posing risks in biased model training and privacy leakage. We propose WavShape, an information-theoretic speech representation learning framework that optimizes embeddings for fairness and privacy while preserving task-relevant information. We leverage mutual information (MI) estimation using the Donsker-V aradhan formulation to guide an MI-based encoder that systematically filters sensitive attributes while maintaining speech content essential for downstream tasks. Experimental results on three known datasets show that WavShape reduces MI between embeddings and sensitive attributes by up to 81% while retaining 97% of task-relevant information.

information, machine learning, natural language, (18 more...)

doi: 10.21437/Interspeech.2025-2528

2506.22789

Country: North America > United States (0.47)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Neural Information Processing SystemsAug-19-2025, 07:48:22 GMT

d81ecfc8fb18e833a3fa0a35d92532b8-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (18 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > France > Île-de-France > Paris > Paris (0.04)
North America > Dominican Republic (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.71)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Cognitive Science > Neuroscience (0.68)

Neural Information Processing SystemsAug-16-2025, 13:50:01 GMT

Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing

Considering the high cost of collecting large-scale transcriptions, learning rich speech representations via SSL has become crucial and promising for empowering low-resource ASR.

artificial intelligence, machine learning, natural language, (13 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Emezue, Chris, Community, NaijaVoices, Awobade, Busayo, Owodunni, Abraham, Emezue, Handel, Emezue, Gloria Monica Tobechukwu, Emezue, Nefertiti Nneoma, Ogun, Sewade, Akinremi, Bunmi, Adelani, David Ifeoluwa, Pal, Chris

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

arXiv.org Artificial IntelligenceJul-15-2025

The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.

artificial intelligence, machine learning, natural language, (15 more...)

2505.20564

Country:

North America > United States (0.37)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)

Moussa, Omer, Toneva, Mariya

Brain-tuned Speech Models Better Reflect Speech Processing Stages in the Brain

arXiv.org Artificial IntelligenceJun-5-2025

Pretrained self-supervised speech models excel in speech tasks but do not reflect the hierarchy of human speech processing, as they encode rich semantics in middle layers and poor semantics in late layers. Recent work showed that brain-tuning (fine-tuning models using human brain recordings) improves speech models' semantic understanding. Here, we examine how well brain-tuned models further reflect the brain's intermediate stages of speech processing. We find that late layers of brain-tuned models substantially improve over pretrained models in their alignment with semantic language regions. Further layer-wise probing reveals that early layers remain dedicated to low-level acoustic features, while late layers become the best at complex high-level tasks. These findings show that brain-tuned models not only perform better but also exhibit a well-defined hierarchical processing going from acoustic to semantic representations, making them better model organisms for human speech processing.

artificial intelligence, machine learning, natural language, (20 more...)

2506.03832

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.34)

arXiv.org Artificial IntelligenceMay-28-2025

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

Gao, Ming, Wu, Shilong, Chen, Hang, Du, Jun, Lee, Chin-Hui, Watanabe, Shinji, Chen, Jingdong, Marco, Siniscalchi Sabato, Scharenborg, Odette

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Di-arization (A VSD), Audio-Visual Speech Recognition (A VSR), and Audio-Visual Diarization and Recognition (A VDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top A VSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top A VSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best A VDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

artificial intelligence, machine learning, recognition, (13 more...)

2505.13971

Country: Europe (0.46)

Genre:

Research Report (0.50)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)